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Abstract. It was known from Metropolis et al. [J. Chem. Phys. 21 
(1953) 1087-1092] that one can sample from a distribution by perform- 
ing Monte Carlo simulation from a Markov chain whose equilibrium 
distribution is equal to the target distribution. However, it took sev- 
eral decades before the statistical community embraced Markov chain 
Monte Carlo (MCMC) as a general computational tool in Bayesian 
inference. The usual reasons that are advanced to explain why statis- 
ticians were slow to catch on to the method include lack of computing 
power and unfamiliarity with the early dynamic Monte Carlo papers 
in the statistical physics literature. We argue that there was a deeper 
reason, namely, that the structure of problems in the statistical me- 
chanics and those in the standard statistical literature are different. 
To make the methods usable in standard Bayesian problems, one had 
to exploit the power that comes from the introduction of judiciously 
chosen auxiliary variables and collective moves. This paper examines 
the development in the critical period 1980-1990, when the ideas of 
Markov chain simulation from the statistical physics literature and the 
latent variable formulation in maximum likelihood computation (i.e., 
EM algorithm) came together to spark the widespread application of 
MCMC methods in Bayesian computation. 
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1. INTRODUCTION 

This paper surveys the historical development of 
MCMC methodology during a key time period in 
Bayesian computation. As of the mid-1980s, the Baye- 
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sian community was focused on Gaussian quadra- 
ture type methods, Laplace approximations and vari- 
ants of importance sampling as the main computa- 
tional tools in Bayesian analysis. Among more dog- 
matic Bayesians, the use of Monte Carlo was met 
with resistance and viewed as antithetical to Bayesian 
principles. MCMC techniques published in the sta- 
tistical physics and image analysis literature were 
seen by the Bayesian computational community as 
techniques for specialized problems. However, by the 
early 1990s MCMC-based approaches have become 
a mainstay in computational Bayesian inference. The 
purpose of this paper is to review the events that led 
to this remarkable development. In particular, we 
examine the critical decade of 1980-1990 when the 
ideas of Markov chain simulation from the statistical 
physics literature and the latent variable formula- 
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tion in maximum likelihood computation (i.e., EM 
algorithm) came together to spark the widespread 
application of MCMC methods in Bayesian compu- 
tation. 

2. SOME PRE HISTORY 

2.1 Markov Chain Monte Carlo 

The origin of MCMC can be traced to the early 
1950s when physicists were faced with the need to 
numerically study the properties of many particle 
systems. The state of the system is represented by 
a vector x = (xi,X2, . . . , x n ), where X{ is the coordi- 
nate of the ith particle in the system and the goal is 
to study properties such as pressure and kinetic en- 
ergy, which can be obtained from computation of the 
averaged values of suitably defined functions of the 
state vector. The averaging is weighted with respect 
to the canonical weight exp(—E(x)/kT), where the 
constants k and T denote the Boltzmann constant 
and the temperature, respectively. The physics of 
the system is encoded in the form of the energy func- 
tion. For example, in a simple liquid model one has 
the energy E(x) = (1/2) ^ Yjtyj v (\ x i ~ x j\)> where 
V(-) is a potential function giving the dependence 
of pair-wise interaction energy on the distance be- 
tween two particles. Metropolis et al. (1953) intro- 
duce the first Markov chain Monte Carlo method 
in this context by making sequential moves of the 
state vector by changing one particle at a time. In 
each move, a random change of a particle is pro- 
posed, say, by changing to a position chosen within 
a fixed distance from its current position, and the 
proposed change is either accepted or rejected ac- 
cording to a randomized decision that depends on 
how much the energy of the system is changed by 
such a move. Metropolis et al. justified the method 
via the concepts of ergodicity and detailed balance 
as in kinetic theory. Although they did not explic- 
itly mention "Markov chain," it is easy to trans- 
late their formulation to the terminology of modern 
Markov chain theory. In subsequent development, 
this method was applied to a variety of physical 
systems such as magnetic spins, polymers, molecu- 
lar fluids and various condense matter systems (re- 
viewed in Binder, 1978), but all these applications 
share the characteristics that n is large 1 and the n 
components are homogeneous in the sense that each 



1 In the words of Geman and Geman (1984), "The Metropo- 
lis algorithm and others like it were invented to study the 



takes value in the same space (say, 6-dimensional 
phase space, or up/down spin space, etc.) and inter- 
acts in identical manner with other components ac- 
cording to the same physical law as specified by the 
energy function. These characteristics made it diffi- 
cult to recognize how the method can be of use in a 
typical Bayesian statistical inference problem where 
the form of the posterior distribution is very differ- 
ent from the Boltzmann distributions arising from 
physics. For this reason, although the probability 
and statistical community was aware of MCMC very 
early on (Hammersley and Handscomb, 1964 2 ) and 
had in fact made key contributions to its theoretical 
development (Hastings, 1970), the method was not 
applied 3 to Bayesian inference until the 1980s. 



equilibrium properties, especially ensemble averages, time- 
evolution and low-temperature behavior, of very large sys- 
tems of essentially identical, interacting components, such as 
molecules in a gas or atoms in binary alloys." 

interestingly, Hammersley and Handscomb (1964) present 
the discrete version of the Metropolis algorithm in the chap- 
ter entitled "Problems in statistical mechanics," rather than 
in the chapter on "Principles of the Monte Carlo method" 
which covers topics such as crude Monte Carlo, stratified sam- 
pling, importance sampling, control variates, antithetic vari- 
ates and regression methods. Similarly, in another popular 
textbook, Rubinstein (1981) presents the discrete version of 
the Metropolis algorithm toward the end of his book, embed- 
ded in an algorithm for 'global' optimization, rather than ear- 
lier in, for example, Chapter 3, "Random variate generation" 
or in Chapter 4, "Monte Carlo integration and variance re- 
duction techniques." Also, Ripley (1987) presents the discrete 
version of the Metropolis algorithm in the section "Metropo- 
lis' method and random fields" (but not in the previous chap- 
ter on simulating random variables or in the following chapter 
on Monte Carlo integration and importance sampling). 

3 Several authors have argued that this delay can be at- 
tributed to "lack of appropriate computing power" (quoting 
Robert and Casella, 2010). Kass (1997) in his J AS A review of 
the book Markov Cham Monte Carlo in Practice remarks on 
page 1645: "I believe that MCMC became important in statis- 
tics when it did for the simple reason that in the early 1990s 
large numbers of researchers could implement it on their desk- 
tops for interesting, nontrivial problems. . . I would suggest 
that the timing of their growth in popularity is explained pri- 
marily by computing technology." [This argument is quoted 
almost verbatim by Hitchcock (2003) — see Gubernatis (2003) 
who presents several "first-hand" accounts of the history of 
the Metropolis algorithm, as well as discusses why the al- 
gorithm received scant use in the 10-15 years following its 
development.] 

This response does not explain why in 1987, when research 
universities had long since transitioned away from the uni- 
versity mainframe (which billed for every second of CPU us- 
age, every page of memory, etc.) to departmental workstations 
(typically VAX 780's or 750's, which were essentially free to 
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2.2 Latent Variable Methods in Likelihood 
Inference: The EM Algorithm 

During the 1960s and 1970s, statisticians devel- 
oped an approach to maximum likelihood compu- 
tation that is quite effective in many popular sta- 
tistical models. The approach was based on the in- 
troduction of latent variables into the problem so 
as to make it feasible to compute the MLE if the 
latent variable value were known. Equivalently, if 
one regards the latent variable as "missing data", 
then this approach relies on the simplicity of infer- 
ence based on the "complete data" to design an it- 
erative algorithm to compute the maximum likeli- 
hood estimate and the associated standard errors. 
This development culminated in the publication of 
the extremely influential paper Dempster, Laird and 
Rubin (1977). A review of earlier research treating 
specific examples was presented in that paper, as 
well as the associated discussion. The high impact 
of Dempster, Laird and Rubin stems from its com- 
pelling demonstration that a wide variety of seem- 
ingly unrelated problems in standard statistical in- 



faculty), the key computational Bayesian methods advocated 
at the time were based on quadrature or Laplace approxima- 
tions or importance sampling (see the Appendix). Of course, 
one could not carry a VAX 750 around in a briefcase and it 
was not as powerful as current PCs, but one could log on from 
home or from the office and submit a job to run overnight (or 
over a weekend) without having to deal with a charge-back 
system. Clearly, the state of computing in the early 1980s did 
not detract from the work of Geman and Geman (1984). 

It may not have been a lack of computing power per se 
that contributed to this developmental delay, but possibly a 
concern regarding optimal algorithms. In this regard, Geman 
(1988a) notes: ". . . we view Monte Carlo optimization tech- 
niques as research tools. They are poor substitutes for the 
efficient dedicated algorithms that should be developed when 
facing applications involving a flow of data and a need for 
speedy analysis." Efron (1979) mentions that Tukey wanted 
to call the bootstrap the shotgun, because it "can blow off 
the head of any problem if the statistician can stand the re- 
sulting mess." The same can be said for Metropolis-Hastings 
algorithms, as they can solve any simulation problem, but as 
noted by Geman (1988a), there is a strong sense that one 
could find a more efficient approach to implementing the sim- 
ulation in the specific case. 

Robert and Casella (2010) also cite "lack of background 
on Markov chains." We question this as well, for by the mid- 
1970s it was standard practice to require a stochastic process 
course based on such texts as Karlin and Taylor (1975) as 
part of the degree requirements for a Ph.D. in Statistics. The 
treatment in Karlin and Taylor (1975), for example, provides 
a more than sufficient background to understand the related 
material in Hammersley and Handscomb (1964). 



ference, including multinomial sampling, normal lin- 
ear models with missing values, grouping and trun- 
cation, mixture problems and hierarchical models, 
can all be encompassed within this latent variable 
framework and thus become computationally fea- 
sible using the same algorithm (called the EM al- 
gorithm by Dempster, Laird and Rubin) for MLE 
inference. 

Because of its influence in later MCMC methods 
on the same set of problems, we briefly review a 
simplified formulation of the EM algorithm: Let y 
be the observed data vector, pe(y) be the density 
of y, and we are interested in the inference regard- 
ing 6. Two conditions are assumed for the applica- 
tion of the EM. First, it is assumed that although 
the likelihood L(6\y) = pe{y) may be hard to work 
with, one can introduce a latent (i.e., unobserved) 
variable z so that the likelihood L(6\y,z) = pe(y,z) 
based on the value of y and z becomes easy to op- 
timize as a function of 6. In fact, for simplicity, we 
assume that pe{y 1 z) is an exponential family distri- 
bution. The second condition is that for any fixed 
parameter value 6, it is possible to compute the ex- 
pectation of the sufficient statistics of the exponen- 
tial family, where the expectation is over z under 
the assumption that z is distributed according to its 
conditional distribution pe{z\y). We will see below 
that these conditions are closely related to the ones 
under which the most popular form of MCMC algo- 
rithm for Bayesian computation, namely, the Gibbs 
sampler, is applicable. 

3. EMERGENCE OF MCMC IN BAYESIAN 
COMPUTATION IN THE 1980S 

3.1 The State of Bayesian Computation 
in the Mid-1980s 

The early 1980s was an active period in the de- 
velopment of Bayesian computational methods. In 
addition to the traditional approach that relied on 
the use of conjugate priors to obtain analytically 
tractable posterior distributions, significant progress 
was made in numerical approximations to the pos- 
terior distribution. We now briefly review the main 
approaches. 

In many problems it is easy to evaluate the joint 
posterior density up to a constant of proportional- 
ity. The difficulty is to obtain posterior moments 
and marginal distributions of selected parameters 
of interest. Numerical integration methods were de- 
veloped to obtain these quantities from the joint 
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posterior. In particular, Naylor and Smith (1982) 
and Smith et al. (1985) advocated the use of Gaus- 
sian quadrature which would be the correct choice in 
large sample situations when the posterior is approx- 
imately normal. Alternatively, Kloek and van Dijk 
(1978, 1980) proposed the use of importance sam- 
pling to carry out the integration, and applied the 
method systematically in the context of simultane- 
ous equation models. Many novel variations were ex- 
perimented with in both approaches, including the 
important idea of adaptation where a preliminary 
integration was used to guide the choice of param- 
eters (grid points, importance function, etc.) in a 
second round of integration. 

Another influential work was Tierney and Kadane 
(1986) which uses the technique of Laplace approxi- 
mation to obtain accurate approximations for poste- 
rior moments and marginal densities; albeit in con- 
trast to the other approaches, the accuracy of this 
approximation is determined by the sample size and 
not under the control of the Bayesian analyst. 

These efforts demonstrated that accurate numer- 
ical approximation to marginal inference can be ob- 
tained in problems with moderate dimensional pa- 
rameter space [e.g., Smith et al. (1985) report suc- 
cess on problems with up to 6 dimensions] and cre- 
ated a great deal of excitement in the prospect of 
computational Bayesian inference (see, e.g., Zellner, 
1988). On the other hand, a review of the writings 
of leading Bayesian statisticians in this period re- 
veal no awareness of the promise of the MCMC 
approach that would soon emerge as a dominate 
tool in Bayesian computation. In fact, among more 
dogmatic Bayesians, 4 the use of Monte Carlo was 
met with resistance and viewed as antithetical to 
Bayesian principles. 

We are now ready to discuss the specific develop- 
ments that sparked the emergence of MCMC method- 
ology in statistics. 

3.2 Formulation of the Gibbs Sampler 

In 1984, Geman and Geman published a paper on 
the topic of Bayesian image analysis (Geman and 
Geman, 1984). Beyond its immediate and large im- 
pact in image analysis, this paper is significant for 
several results of more general interest, including a 



4 For a more detailed review of the prevailing approaches 
and views on Bayesian computation through the late 1980s, 
see the Appendix. 



proof of the convergence of simulated annealing, and 
the introduction of the Gibbs sampler. 

We briefly review how the Gibbs sampler emerged 
in this context. The authors began by drawing an 
analogy between images and statistical mechanics 
systems. Pixel gray levels and edge elements were 
regarded as random variables and an energy func- 
tion based on local characteristics of the image was 
used to represent prior information on the image 
such as piece- wise smoothness. Because interaction 
energy terms involved only local neighbors, the con- 
ditional distribution of a variable given the remain- 
ing components of the image depends only on its lo- 
cal neighbors, and is therefore easy to sample from. 
Such a distribution, for the systems of image pixels, 
is similar to the canonical distribution in statistical 
mechanics studied by Boltzmann and Gibbs, and it 
is thus called a Gibbs distribution for the image. 

Next, the authors analyzed the statistical problem 
of how to restore the image from an observed image 
which is a degradation of true image through the 
processes of local blurring and noise contamination. 
They showed that the posterior distribution of true 
image given the observed data is also a Gibbs distri- 
bution whose energy function still involves only local 
interactions. Geman and Geman proposed to gener- 
ate images from this posterior distribution by itera- 
tively sampling each image element from its condi- 
tional distribution given the rest of the image, which 
is easy to do because the distribution is still Gibbs. 
They call this iterative conditional sampling algo- 
rithm the Gibbs sampler. For the history of Bayesian 
computation, this was a pivotal step — although sim- 
ilar algorithms were already in use in the physics lit- 
erature; to our knowledge, this work represented the 
first proposal to use MCMC to simulate from a pos- 
terior distribution. On the other hand, because the 
Gibbs model for images is so similar to the (highly 
specialized) statistical physics models, it was not ap- 
parent that this approach could be effective in tra- 
ditional statistical models (see the Appendix). 

3.3 Introduction of Latent Variables and 
Collective Moves 

The use of iterative sampling for Bayesian infer- 
ence in traditional statistical models was first demon- 
strated in Tanner and Wong (1987). The problems 
treated in this work, such as normal covariance esti- 
mation with missing data, latent class models, etc., 
were of the type familiar to mainstream statisti- 
cians of the time. A characteristic of many of these 
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problems was that the likelihood is hard to compute 
(thus not amenable to MCMC directly). To perform 
Bayesian analysis on these models, the authors em- 
bedded them in the setting of the EM algorithm 
where a latent variable z can be introduced to sim- 
plify the inference of the parameter 9. They started 
from the equations 

(3.1) p(6\y) = J p(6\y,z)p(z\y)dz, 

(3.2) p(z\y) = Jp(z\9,y)p(9\y)d6. 

Recall that the conditions needed for the EM to 
work well are that pe{y, z) is simple to work with as 
a function of 6, and po(z\y) is easy to work with as 
a function of z. The first condition usually implies 
that the complete data posterior p(9\y, z) is also easy 
to work with. Thus, (3.1) can be approximated as 
a mixture of p(9\y,z) over a set of values (mixture 
values) for z drawn from (3.2). Similarly, (3.2) is 
approximated as a mixture of p(z\9,y) over mixture 
values for 9 drawn from (3.1). This led the authors 
to propose an iterated sampling scheme to construct 
approximations to p{9\y) andp{z\y) simultaneously. 
In each step of the iteration, one draws a sample of 
values with replacement from the mixture values for 
z (or 9), and then conditional on each such z, draws 
9 (or z) from p(9\y, z) [or p{z\9, y)\. 

This computation is almost identical to a version 
of the Gibbs sampler that iterates between the sam- 
pling of p(9\y,z) and p(z\9,y). In fact, if the sam- 
pling from the mixture values for z (or 9) was done 
without replacement rather than with replacement, 
as suggested by Morris (1987), then one would have 
exactly a population of independently run Gibbs 
samplers. The authors also noted 5 the connection to 
the equilibrium distribution of a Markov chain, but 
they did not employ it as the main mathematical 
framework in their analysis. In any case, a promi- 
nent aspect of its relevance lies in the explicit in- 
troduction of the latent variable z, which may or 



J Quoting Tanner and Wong (1987): "To see this, consider 
the extreme case in which m = 1, so that iteration i pro- 
duces only one value 9(i). In this case, 9(i) (i = 1,2,.. .) forms 
a Markov process with transition function equal to K(9,(j)), 
as defined in (2.3). Under the regularity conditions of Sec- 
tion 6, this is an ergodic Markov process with an equilib- 
rium distribution satisfying the fixed point equation given in 
(2.3)." Equation (2.3) presents the Markov transition func- 
tion K(9, <j>) (see also their Markov transition operator discus- 
sion in Remark 4) via the expression g(9) — j K{9, 4>)g{4>) d<t>> 
where K(0, (j>) = / p(9\z, y)p{z\(f>, y) dz. 



may not be part of the data vector or the param- 
eter vector of the original statistical model, to cre- 
ate an iterative sampling scheme for the Bayesian 
inference of the original parameter 9. Tanner and 
Wong referred to this aspect of the design of the al- 
gorithm as "data augmentation." A judicious choice 
of latent variables can allow one to sample from the 
posterior p(9\y) in cases where direct MCMC meth- 
ods, including the Gibbs sampler, may not even be 
applicable because of difficulty in evaluating p(9\y). 

As a discussant of Tanner and Wong (1987), Mor- 
ris (1987) makes several key observations of great 
relevance to MCMC Bayesian computing. In addi- 
tion to suggesting a version of data augmentation 
that is the same as parallel Gibbs sampling, he em- 
phasizes that (just as in the EM context) the aug- 
mentation is not limited to missing data, but can 
be done with parameters as well: "and to empha- 
size that their 'missing data' concept can be used 
to include unknown parameters or latent data." As 
an illustration of the data augmentation algorithm, 
Morris (1987) presents what we would now call the 
Gibbs sampler for a three-stage hierarchical model 
with k + 1 parameters. At the first level of his model, 
yi\9i are distributed independently as N(9i,Vi), for 
i = 1, . . . ,k (Vi known). At the second stage, 9i\A 
are i.i.d. A r (0, A), i = 1, . . . , k. At the final stage, A 
is distributed completely known distribution. 

Morris then says, "Let initial values A^ , . . . , A^ 
be given. The posterior distribution of 9 given (y, A) 

(i) 

is normally distributed and the P step samples 9\ ~ 
N((l-B®) Vi , Vi(l-B®)) for % = 1, . .. , k, j = 1, . . . , 
m, independently, with b\ j) = Vi/(Vi + A^) 
For the A parameter he writes, "The I step (1.5), 
therefore, samples new values A^ , . . . , A^ accord- 
ing to A® ~ (A + ||^|| 2 )/xl +g for j = l,...,m, 
with xl+g sampled independently for each j, \\9\\ 2 
denoting the sum of squares." 

Although no new theory beyond MCMC is needed 
for the analysis of a sampling algorithm designed to 
include latent variables, the nature of the resulting 
process may be drastically different from the tradi- 
tional MCMC processes, even in cases when the pos- 
terior p(9\y) is computable and therefore amenable 
to direct MCMC analysis. Consider, for example, a 
linear model y = xf3 + e, where the errors are inde- 
pendently distributed according to Student's t with 
a fixed degree of freedom. In this case the joint pos- 
terior density for f3 is computable and one can apply 
the standard Metropolis sampler to sample from it, 
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by iteratively proposing to change the (3 vector, one 
component at a time. However, the sampler may 
be easily trapped at local maxima. What is worse, 
the moves may be exceedingly localized (therefore 
slow) if there is serious collinearity. On the other 
hand, by regarding the error as a gamma-mixture of 
normal variables, one can condition on the gamma 
variables and generate the whole vector f3 (i.e., a 
collective move), which allows large moves even in 
the presence of collinearity. 6 

Interestingly, at about the same time, Swendsen 
and Wang (1987) also introduced the use of latent 
variables (called auxiliary variables) in the setting of 
a statistical mechanics system. This work deals with 
the Potts model of spins on a lattice. The authors 
introduced bond variables between spins and then 
alternated between the sampling of the two types of 
variables, spins and bonds. By conditioning on the 
bonds, they were able to make more global changes 
of the spin configuration by simultaneously updat- 
ing a whole cluster of spins that are connected by 
active bonds (i.e., a collective move). In this way 
they were able to dramatically reduce the correla- 
tion time of the resulting Markov process for sim- 
ulating a two-dimensional Ising model. Justifiably, 
this work is widely regarded as a breakthrough in 
dynamic Monte Carlo methods in statistical physics. 

3.4 A Synthesis 

Above we described how MCMC Bayesian com- 
putation arose in the 1980s from two independent 
sources, the statistical physics heritage as represented 
by Geman and Geman (1984), and the EM heritage 
as represented by Tanner and Wong (1987). A syn- 
thesis of these two traditions occurred in the im- 
portant work of Gelfand and Smith (1990). Like the 
former, they employed the Gibbs sampling version 
of MCMC. Like the latter, they focused on tradi- 
tional statistical models and relied on the use of la- 
tent variables to create iterative sampling schemes. 
Their paper 7 provided many examples to illustrate 



For simplicity, the errors are unsealed in our example. If 
the errors have a scaling parameter, then it will have corre- 
lation with the gamma variables and this will slow down the 
convergence. More advanced augmentation schemes can be 
created to break this correlation and accelerate convergence; 
see Liu, Rubin and Wu (1998), Liu and Wu (1999) and van 
Dyk and Meng (2001). 

7 Also of note are the papers of Li (1988) who used a multi- 
component Gibbs sampler to perform multiple imputation 
from the posterior, Gelman and King (1990) who employed 



the ease of use and effectiveness of iterative sam- 
pling, and clarified the relation between the data 
augmentation algorithm and the Gibbs sampler. 

The framing of data augmentation as MCMC also 
raised some new and interesting theoretical issues 
in the analysis of the MCMC output. For exam- 
ple, it follows from (3.2) that in data augmentation 
the estimate of an expectation E(g(9)\y) is given by 
- X^ILi E{g{6)\y, Zi), where the z^s are the currently 
sampled values for the latent variable z. Gelfand and 
Smith refer to the use of this estimate, instead of the 
usual estimate - J2i=i ff(^i)) as Rao-Blackwellization. 
They reasoned that if the Zi's are independently 
drawn, as in a final iteration of the data augmen- 
tation algorithm, then clearly Rao-Blackwellization 
will reduce estimation error. They did not analyze 
the situation when the samples are dependent, as 
when the samples are generated from the Gibbs sam- 
pling process. The superiority of the Rao-Blackwelli- 
zed estimator in the two-component Gibbs sampler 
was later established in Liu, Wong and Kong (1994) — 
see also Geyer (1995). 

After the publication of Gelfand and Smith's influ- 
ential paper, many mainstream statisticians began 
to adapt the use of MCMC in their own research, 
and the results in these early applications quickly 
established MCMC as a dominant methodology in 
Bayesian computation. However, it should be noted 
that in any given problem there could be a great 
many ways to formulate a MCMC sampler. In sim- 
ulating an Ising model, for example, one can try 
to flip each spin conditional on the rest, or flip a 
whole set of spins connected by (artificially intro- 
duced) bonds that are sampled alternatively with 



MCMC to analyze hierarchical models of voting data across 
districts and Spiegelhalter and Lauritzen (1990) who treated 
graphical models. The latter appeared to have inspired the 
popular MCMC software BUGS (Lunn et al., 2009). 

8 Smith (1991) notes that regarding the Gibbs sampler: 
"Substantial iterative computation is then required, but the 
need for sophisticated numerical understanding on the part of 
the statistical analyst is obviated." Gelfand et al. (1990) also 
comment that sampling methods have a "hidden" efficiency: 
"In addition, in the case of most of the more sophisticated 
techniques, substantial fresh effort is required (including, in 
some cases, beginning the analysis anew) if the focus of in- 
ferential interest changes. ..." Tanner and Wong (1987, pages 
533 and 549), in their normal variance-covariance matrix ex- 
ample, make the same point when they note the ease in which 
one may examine the posterior distribution of any function of 
the variance-covariance matrix, such as the smallest eigen- 
value (see their Figure 1 in the Rejoinder). 
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the spins. The effectiveness of the Swendsen and 
Wang (1987) algorithm in the Ising model does not 
simply stem from the fact that it is a Gibbs sampler, 
but rather depends critically on the clever design of 
the specific form of the sampler. Likewise, a large 
part of the success of MCMC in the early 1990s was 
based on versions of Gibbs samplers that were de- 
signed to exploit the special structure of statistical 
problems in the style of the EM and data augmen- 
tation algorithms. Thus, the emergence of MCMC 
in mainstream Bayesian inference has depended as 
much on the introduction of the mathematically el- 
egant MCMC formalism as on the realization that 
the structure of many common statistical models 
can be fruitfully exploited to design versions of the 
algorithm that are feasible and effective for these 
models. 

The appearance of Gelfand and Smith (1990) mar- 
ked the end of the emergence of the MCMC ap- 
proach to the study of posterior distributions, and 
the beginning of an exciting period, lasting to this 
day, of the application of this approach to a vast ar- 
ray of problems, including inference in non-parametric 
problems. Advanced techniques have also been de- 
veloped in this framework to accelerate convergence. 
Statisticians, no longer laggards in MCMC method- 
ology, now rival physicists in the advancement of 
MCMC methodology. 9 It is our hope that this pa- 
per will serve as a useful historical context to un- 
derstand current developments. 

APPENDIX 

In this appendix we present three key resources 
that define the state-of-the-art Bayesian computing 
as of the mid to late 1980s. 

A.l Smith et al.(1985) 

A key reference that catalogs the tools in the Baye- 
sian's armamentarium as of 1985 is the paper by 
Adrian F. M. Smith and colleagues entitled "The 
implementation of the Bayesian paradigm" (see Smith 
et al., 1985). After providing an overview of the goals 
of Bayesian computation, the authors critically re- 
view a number of implementation strategies: (1) ex- 
act analytic implementation based on conjugate pri- 
ors; (2) large sample normal approximation; (3) al- 
ternative asymptotic approximations including dis- 
cussion of a preprint of Tierney and Kadane (1986) 



9 For example, Geyer (1991) introduced parallel temper- 
ing ahead of related concepts in the physics literature — see 
Hukushima and Nemoto (1996). 



outlining the Laplace approach; (4) Monte Carlo 
integration — specifically importance sampling; (5) 
quadrature methodology — specifically the approach 
of Naylor and Smith (1982) based on the Gauss- 
Hermite product rules; and (6) successive transfor- 
mations of the parameters of the model to achieve 
normality. The remainder of Smith et al. (1985) re- 
views in considerable detail the Naylor and Smith 
(1982) strategy based on Gaussian quadrature, as 
well as presents several interesting examples. The 
authors conclude with the following remark: "novel 
numerical integration techniques together with effi- 
cient graphical procedures are now making Bayesian 
analysis practical for a wide range of problems." 

A. 2 The 1987 Special Issue of JRSS D 
(The Statistician) 

In 1987, the Journal of the Royal Statistical Soci- 
ety Series D (The Statistician) published a special 
issue entitled "Practical Bayesian statistics," edited 
by Gopal K. Kanji with technical editors Adrian F. 
M. Smith and A. P. Dawid. Following up on the 
report of Smith et al. (1985), Smith et al. (1987) 
discuss an adaptive approach where Gauss-Hermite 
quadrature methods are combined with parameter 
transformations in an iterative manner, namely, suc- 
cessive transformations are determined by the esti- 
mated variance-covariance matrix of the previous 
iteration. Also discussed in this paper is an iterative 
importance sampling strategy, where the informa- 
tion in the previous iteration is used to improve the 
importance function for the transformed parame- 
ters. Smith et al. (1987) suggest using "quasi-randorh 
sequences on the /c-dimensional hypercube and ref- 
erence the preprint of Shaw (1988a). 

The paper by Smith et al. (1987) sets the com- 
putational theme of the issue, as many of the other 
papers make use of either Gauss-Hermite quadra- 
ture methodology or importance sampling, van Dijk, 
Hop and Louter (1987) present the details of an al- 
gorithm for the computation of posterior moments 
and densities based on importance sampling, specif- 
ically that the importance sampling function can be 
adapted based on the output of the previous itera- 
tion. In van Dijk, Hop and Louter (1987), the poste- 
rior mean and the posterior covariance matrix based 
on the output of the previous iteration is used to 
update the parameters of the multivariate Student 
t importance function. 

Stewart (1987) illustrates how (nonadaptive) im- 
portance sampling can be used in the context of 
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hierarchical Bayesian models. Grieve (1987), Mar- 
riott (1987), Naylor (1987) and Shaw (1987) over- 
come the analytic intractability of the posterior us- 
ing the adaptive Gauss-Hermite integration strat- 
egy of Naylor and Smith (1982). Van Der Merwe and 
Groenewald (1987) approximate the posterior dis- 
tribution with a Pearson distribution, while Achcar, 
Bolfarine and Pericchi (1987) make use of the Laplace 
approximation discussed in Tierney and Kadane 
(1986). Spiegelhalter (1987), in his treatment of ev- 
idence propagation in expert systems, briefly men- 
tions stochastic relaxation (Geman, 1988b) as a pos- 
sible technique but does not use it in his treatment. 10 
Finally, O'Hagan (1987) expressed strong objection 
to using Monte Carlo methods in Bayesian inference- 
even going as far as giving his paper the title "Monte 
Carlo is fundamentally unsound". 

A. 3 The Third Valencia International Meeting: 
June 1-5, 1987 

The Third Valencia International Meeting on Baye- 
sian Statistics was held on June 1-5, 1987. Accord- 
ing to the Preface of the Proceedings (see Bernardo 
et al., 1988), the scientific program consisted of 31 
invited papers, each with discussion, and 33 refer- 
eed contributed papers: "The selection of topics, au- 
thors and discussants ensures that those Proceed- 
ings provide a definitive up-to-date overview of cur- 
rent concerns and activity in Bayesian Statistics, en- 
compassing a wide range of theoretical and applied 
research." 

As was seen in the paper by Smith et al. (1985), 
as well as the special issue of The Statistician, the 
key computational approaches vying for contention 
in this Proceedings are approximations based on 
Laplace's method, Gaussian quadrature numerical 
integration — possibly implemented in an iterative 
manner, and importance sampling — possibly embel- 
lished with an adaptive procedure to update the im- 
portance function. A survey of available software for 
Bayesian analysis was presented by Goel (1988) and 
this paper was discussed by van Dijk (1988). In the 
body of the discussion, van Dijk (1988) proposes 
a decision tree to determine which computational 
technique is most suitable for the problem at hand. 
He notes that if the posterior is "reasonably well be- 
haved in the sense that it is unimodal, continuous, 
proper, not too skewed," then a Laplace approxima- 
tion approach or possibly an importance sampling 



See also Pearl (1987). 



approach using a Student t importance function, 
as discussed in van Dijk, Hop and Louter (1987), 
should be used. If a transformation of the param- 
eters results in a "more regular shape of the pos- 
terior," then he suggests Naylor and Smith (1982). 
If the posterior distribution is unimodal, but not 
much more is known, then van Dijk suggests im- 
portance sampling with adaptive importance func- 
tions. In this regard, van Dijk (1988) references the 
preprint of Geweke (1989) who also advocates for 
importance sampling and who remarks, "Integration 
by Monte Carlo is an attractive research tool be- 
cause it makes numerical problems much more rou- 
tine than do other numerical integration methods." 

A second paper on Bayesian software was pre- 
sented by Smith (1988). In the first sentence of the 
paper's abstract, he notes, "Recent developments in 
methods of numerical integration and approxima- 
tion, in conjunction with hardware trends towards 
the widespread availability of single-user worksta- 
tions which combine floating-point arithmetic power 
with sophisticated graphics facilities in an integrated 
interactive environment, would seem to have removed 
whatever excuses were hitherto historically avail- 
able for the lack of any generally available form 
of Bayesian software." Smith argues that given ad- 
vances in algorithm development, as well as the move- 
ment from large mainframes to workstations, the 
time is ripe for the development of Bayesian soft- 
ware. Smith points out on page 433 that the com- 
putational tools he has in mind are approximations 
based on asymptotic expansions, numerical integra- 
tion methods based on quadrature and importance 
sampling methodology. In discussing the state of the 
field, Zellner (1988) notes, ". . . it is concluded that 
a Bayesian era in econometrics and statistics has 
emerged." 

DuMouchel (1988), Albert (1988), Poirier (1988) 
and Sweeting (1988) employ or suggest the use of 
the Laplace approximation referencing Tierney and 
Kadane (1986) or related references. Morris (1988) 
proposes an approach based on the Pearson fam- 
ily which can be used to generalize the method of 
Laplace. Kass, Tierney and Kadane (1988) discuss 
the Laplace approximation in detail, extending the 
theory, providing examples and discussing imple- 
mentation in the S computing environment. 

Kim and Schervish (1988) make extensive use of 
the Gauss-Hermite approach as presented in Smith 
et al. (1985) and the discussants of this paper sug- 
gest the use of spherical integration rules (as imple- 
mented in Bayes 4)- Shaw (1988b) discusses several 
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approaches to numerical integration, with emphasis 
on Gauss-Hermite and importance sampling with 
quasi-random sequences (see Shaw, 1987, 1988a). 
Grieve (1988) uses Gauss-Hermite in the analysis of 
LD50 experiments, Marriott (1988) uses these meth- 
ods (referencing Bayes 4 ) in the context of ARMA 
time series models and Pole (1988) in the context 
of state-space models. Schnatter (1988) uses gen- 
eralized Laguerre integration for forecasting AR(p) 
time series models. 

Rubin (1988) presents an overview 11 of his im- 
portance sampling based algorithm SIR, which he 
proposes as a general approach for posterior simu- 
lation, distinguishing it from the iterative MCMC 
approach in Tanner and Wong (1987), noting, "The 
SIR (Sampling/Importance Resampling) algorithm 
is an ubiquitously applicable noniterative algorithm 
for obtaining draws from an awkward distribution: 
M draws from an initial approximation are made, 
and then m < M draws are made from these with 
probability approximately proportional to their im- 
portance ratios." 

A noted exception to the Laplace/numerical inte- 
gration/importance sampling approach to Bayesian 
computing is the paper by Geman (1988a). In Sec- 
tion 2.5 on computing, Geman very clearly points 
out the basic idea behind MCMC methods, "Dy- 
namics are simulated by producing a Markov chain, 
X(1),X(2), . . . with transition probabilities chosen 
so that the equilibrium distribution is the posterior 
(Gibbs) distribution (2.4). One way to do this is with 
the Metropolis algorithm (Metropolis et al., 1953). 
More convenient for image processing is a variation 
we call stochastic relaxation." He then proceeds to 
present what we now refer to as the full condition- 
als for the Gibbs sampler in the context of his prob- 
lem. However, neither in the discussion nor in the 
response is there evidence to suggest that anyone at 
the time envisioned the MCMC methodology pre- 
sented in Geman (1988a) having a broader impact 
to general parametric statistical problems. If any- 
thing, the methodology seems pigeon-holed as tech- 
niques for image analysis. Examining the Preface of 
the Proceedings, we find that the paper is summa- 
rized as "The important area of image-processing is 
reviewed by Geman." In fact, Geman appeared to 
have doubts about the practical utility of MCMC 
methods in his remark on page 169: "On the other 



The presentation of this algorithm comprised the bulk of 
his discussion of Tanner and Wong (1987). 



hand, it is indeed difficult to find approaches that 
are as computationally expensive as ours. In this re- 
gard, we view Monte Carlo optimization techniques 
as research tools. They are poor substitutes for the 
efficient dedicated algorithms that should be devel- 
oped when facing applications involving a flow of 
data and a need for speedy analysis." Added to this 
is his comment found on page 171 when referring to 
the algorithm in Besag (1986): "Furthermore, this 
deterministic algorithm is typically far more efficient 
than stochastic relaxation methods." 
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